Morphological knowledge and alignment of English-German parallel corpora
نویسندگان
چکیده
Alignment is an important step to linguistically exploit parallel corpora. In this paper we introduce a morphological component that improves the alignment of German-English parallel texts and helps find correspondences between morphological elements on the sub-word level. This paper deals with a small aspect of an alignment system, namely the improvement of a dictionary-based distance measure through a morphological analyser. What is alignment? For the purposes of this paper we define a bilingual parallel text as a text (L1) and its translation (L2). A sentence level alignment then maps groups of L1-sentences to corresponding groups of L2-sentences. These groups are often called "beads". An alignment can be viewed as a sequence of beads that covers the entire parallel text. While most beads usually express the correspondence between a single L1-sentence and a single L2-sentence, other types of beads arise when sentences are split, merged, deleted, added or changed in order by the translator. Each sentence belongs to exactly one bead. To illustrate some of the difficulties, consider the following excerpt from the very beginning of ‘The War of the Worlds’ parallel text:
منابع مشابه
Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria
We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written ...
متن کاملBilingual Sentence Alignment Based on Punctuation Marks
We present a new approach to aligning English and Chinese sentences in parallel corpora based solely on punctuations. Although the length based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages such as French-English and German-English, it does not fair as well for parallel corpora that are noisy or written in two distant lan...
متن کاملMinimally supervised lemmatization scheme induction through bilingual parallel corpora
We present a lemma induction scheme on a target language through minimally supervised alignment and transfer methods utilizing English-to-German parallel corpora. Compared to previous alignment and transfer approaches, the approach outlined here increases computational efficiency and significantly reduces the level of supervision necessary in inducing clusters of inflectional forms. Furthermore...
متن کاملProjecting Temporal Annotations Across Languages
This thesis investigates the use of parallel corpora for the annotation of temporal objects and relations. In particular, we employ existing tools for the temporal analysis of English to annotate the English portion of an English-German bitext, and automatically project these annotations to the German text, guided by word alignment. Projection-based approaches to multilingual annotation have pr...
متن کاملAutomatic creation of WordNets from parallel corpora
In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordN...
متن کامل